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ABSTRACT 


The Cultural Geography (CG) model, under development in TRAC Monterey, is 
an open-source agent-based social simulation, designed to offer an insight into 
the response of the civilian population during Irregular Warfare (IW) operations. It 
implements social and behavioral science theories that govern the behaviors of 
agents within the simulation using Bayesian belief networks. 

At this stage, the agents within the CG model do not select their actions at 
all. Instead, all their actions are hard coded into the model’s scenario file. As part 
of an attempt to improve the model, this effort sought to enhance the functionality 
within the model by exploring the use of utility functions and, more specifically, 
the concept of reinforcement learning. 

This study began with the development of a learning agent prototype. 
After the initial testing for its functionality, the code that was developed was 
inserted into the main CG model. Based on specially developed scenarios, and 
by employing a design of experiments methodology, we created experimental 
runs. By applying statistical and analysis techniques, we showed that 
reinforcement learning works properly inside the Social Network environment and 
produces the desired results. 

This study can be used as a starting point for the research of the effects of 
reinforcement learning in social modeling in general. 
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I. A NEW APPROACH 


A. INTRODUCTION 

Irregular Warfare (IW) is defined as “a violent struggle among state and 
non-state actors for legitimacy and influence over the relevant population.” (JP 3- 
0, Chapter I, p. 6) 

It is obvious, by reading the above definition, that the main focus of IW is 
the population. As a response to this fact, the need arose for a model that could 
represent the target population in an adequate and realistic way. The Cultural 
Geography (CG) model, under development in TRAC Monterey, is an open- 
source agent-based social simulation, designed to offer insight into the response 
of the civilian population during Irregular Warfare (IW) operations. It implements 
social and behavioral science theories that govern the behaviors of agents within 
the simulation using Bayesian belief networks. 

B. PROBLEM STATEMENT 

At this stage, the agents within the model do not select their actions at all. 
Instead, all their actions are hard coded into the model’s scenario file. Although 
this approach allows the user to explore the impact of a certain sequence of 
events on the population that is being studied, it detracts from the realism of the 
scenario execution and also does not allow the actors within the model to 
explore, and potentially find, the courses of action that could be more useful for 
their purposes. 

As part of an attempt to improve the model, this effort will seek to enhance 
the functionality within the model by exploring the use of utility-based action 
selection, closely related to reinforcement learning. 
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C. RESEARCH QUESTIONS 


The main questions that this study will try to answer are: 

• Is reinforcement learning appropriate for use in social simulations 
and the CG model in particular? 

• What are the advantages of using utility-based agents within the 
CG model? 

D. BENEFITS OF THE STUDY 

This study is expected to impact directly the functionality of the CG model 
by enhancing its capabilities and augmenting its amount of realism. The new 
functionality is expected to give to the user the potential to explore a greater 
amount of possible action sequences in his attempt to determine the sequence 
that results in optimal results for his purposes. It will also help improve the 
insights that the model provides about the target population, and, as a result, arm 
decision makers with more realistic and rich information about their operational 
environment. 

The scope of his study is not limited to the CG model. Its concepts, and 
the results that stem from it, should be considered as applicable within the realm 
of Social Networking in general. Future research could look into applying the 
concept of reinforcement learning to other Social Models. 

E. METHODOLOGY 

The study will begin with the development of an agent prototype that will 
use utility-based reinforcement learning as its driving force. After the initial testing 
of the functionality of this prototype, the concept will be applied to all agents 
within the model. A scenario will be designed, with special focus on 
infrastructure. Based on this scenario, we will design an experiment, varying 
certain parameters that affect the reinforcement learning process. A statistical 
analysis of the results will attempt to illustrate the agents’ proper functionality and 
answer the research questions stated above. 
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F. WHAT COMES NEXT 

Chapter II will lay the cognitive foundation for this study by explaining the 
concept of reinforcement learning. Chapter III will illustrate the process that was 
followed for the creation of the learning agent prototype. It will also present the 
results of the initial test run that was performed to test it. Chapter IV will introduce 
the CG model and its components. It will also describe the experiment that took 
place to validate the new agent’s functionality and the results of the statistical 
analysis that followed. Finally, in Chapter V, the study is concluded by presenting 
a brief discussion of the analysis results and by providing suggestions for 
possible future work on this area of the CG model. 
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II. LOOKING INTO THE PAST 


A. INTRODUCTION 

The purpose of this chapter is to present to the reader a quick overview of 
the concept of reinforcement learning. This concept is the basis upon which the 
creation of the utility-based agent for the CG model was based. First, we will 
begin with an overview, in simple terms, of how reinforcement learning works 
and, finally, we will finish with an argument about the applicability of 
reinforcement learning to social modeling, in general, and the CG model, in 
particular. 

B. REINFORCEMENT LEARNING 

The term “reinforcement learning” describes a concept in which an agent 
uses certain techniques to understand his environment, through the percepts that 
he is receiving from it and the rewards he received from his past actions, and 
eventually decides on his next course of action in order to maximize his potential 
future reward, as he estimates it (Russell & Norvig, 2003) Based on this simple 
explanation of reinforcement learning, we can determine the two main building 
blocks of this concept: rewards policy and decision process. In the following 
subsections, I will try to explain how these two blocks work together to produce 
an intelligent agent. 

1. Useful Terms 

Before we begin our discussion on rewards policy and decision process, it 
is deemed necessary to clarify some terms that will be used in this chapter and 
throughout the study. This will help the reader understand the concept that these 
terms are being used to describe. 
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a. Percepts 


Every agent uses the mechanisms available to him (sensors, 
software, etc.) to assess the state of his environment at every moment. The 
results of this assessment are called percepts. Therefore, in the case of an agent 
playing chess, a percept could be the current state of the chess board. A set of 
these percepts, at a given point in time, form the state of the environment the 
agent exists in. In reinforcement learning, an agent keeps track of the sequence 
of these states he is in, in order to formulate a better understanding of his 
environment and develop a strategy. The development of this strategy will be 
discussed in the next subsection (Russell & Norvig, 2003). 

b. Utility 

A distinction must be made between “point utility” and “utility per 
unit time.” We can define “point utility” as the reward that is received at a specific 
point in time, as a result of an action. “Utility per unit time” is the reward that is 
received over a unit of time. In this study, whenever we refer to utility, we assume 
“point utility”. 

2. Rewards Policy 

A widely accepted hypothesis is that all humans, whether they realize it or 
not, plan a great part of their actions based on the promise of a reward. This 
reward can be material (money, promotion, etc.) or immaterial (self happiness, 
inner peace, etc.). It can be argued that an agent inside a simulation cannot act 
based on immaterial rewards, since he cannot feel (although some people might 
disagree with that point, claiming that feelings could somehow be modeled). 
Therefore, if we accept that an agent cannot feel, the only way we can reward 
such an agent is through material means. This reward is the result of a utility 
function, and is represented by a numeric value. A utility function is defined by 
the agent’s creator and drives the agent’s actions throughout the duration of the 
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simulation. Each time an agent acts, a reward is given according to the results of 
the action. The reward acts as a feedback to the agent about the actions he is 
choosing. 

According to Russell and Norvig (2003, p. 51) “a utility function maps a 
state (or a sequence of states) onto a real number, which describes the 
associated degree of happiness.” The approach of this study is slightly different. 
Instead of being attributed to states, the credit for rewards is assigned to actions, 
thus making the reinforcement learning process independent of the various 
states the agent might find himself in. 

In general, an agent tries to select actions that provide the greatest 
expected reward based on the discounted reward attributable to that action. To 
do that, the agent must keep track of all his past actions and the rewards 
associated with them. Whenever the agent must make a new choice, he must 
revisit his past actions and see what happened. By doing that, he calculates the 
expected utility (expected reward) for each of his candidate actions. 

To better illustrate this procedure, we will use a simple example, as shown 
in Figure 1. 


Reward over time 



Time 


Figure 1. Sample firing of actions and associated rewards 
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Let us assume that an agent, at the present point in time (time 20), has to 
choose between two candidate actions, A and B. The agent, as mentioned 
above, must calculate the expected utility for each of these two actions. Let us 
start with action A. The agent looks into his past actions (Figure 1) and locates 
the points in time when action A was executed (fired). Action A was fired two 
times, at time points 0 and 10. For each of these firings, we calculate the total 
discounted utility that was received after that specific firing time. The formula that 
is used for this calculation is: 


i=l 

In that formula, U is the total utility for the action that was taken at time t, k 
is the number of rewards that were awarded after time ti, n is the reward that was 
awarded at time L, and A is the discount factor. In the case we are examining, for 
the first firing of action A, the total utility is: 

) = 8 ■ + 3 ■ + 5 • 

We should note that, in the above calculation, we used all utilities that 
were awarded after time 1, no matter the action with which they were associated. 

The same procedure is repeated for the firing of action A that took place at 
time 10. For that firing, the total utility is: 

The expected utility for an action is given by the following formula: 
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In that formula, U is the expected utility for the action, and n is the 
number of firings for that action. In our example, the expected utility for action A 
is: 


t/=i[t/(?o) + t/(?,o)] 

This discount factor (lambda) is user defined and can take its values 
between 0 and 1. The meaning of the discount rate is the following: a discount 
rate closer to 0 drives the agent to maximize his short-term reward by not giving 
too much value to future rewards. In contrast, a discount rate closer to 1 drives 
the agent to maximize his long-term reward. When lambda takes the value of 1, 
the agent considers all his future rewards in an additive way (adds them all as 
they are, without any discounting). 

3. Decision Process 

After the calculation of the expected utility for each candidate action, the 
agent must decide which action he will actually perform. Two simple action 
selection methods will be discussed below, the e-greedy action selection method 
and the softmax action selection method. The information presented below is 
explained in far more depth by Sutton and Barto (1998, p. 27-31). 

a. E-greedy Method 

During this method, the agent selects, most of the time, the action 
with the highest expected utility, but sometimes, with a user-defined low 
probability e, he makes his selection randomly and uniformly among the 
candidate actions, without caring about the expected utility of these actions. By 
doing that, the agent is able to explore more fully his candidate actions and 
perhaps eventually reach a more optimal result. It has been shown (Sutton and 
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Barto, 1998, p. 28-29) that the e-greedy method produces better results than the 
pure greedy method, during which the agent always chooses the action with the 
highest expected utility. 

b. Softmax Method 

During this method, all actions are assigned probabilities that 
correspond to their expected utility. So, the action with the highest expected 
utility gets the highest probability, etc. These probabilities are assigned by the 
following formula, which represents the Boltzmann distribution: 



j 

In that formula. Pi is the probability assigned to action i, Ei is the 
expected utility of action i, and t is a parameter called temperature. The 
temperature takes values that are greater than zero. A higher value of 
temperature makes the agent more adventurous in his decisions. This means 
that the agent is more likely to choose an action with a lower value of Pi. As the 
temperature approaches 0, the agent becomes greedier and chooses the action 
with the highest value of Pi. It must be noted that this “softmax effect” can be 
achieved in other ways other than the Boltzmann distribution. 

c. Brief Discussion 

The main difference between the two methods described above lies 
in what the agent does when he is not choosing in a greedy way. In the e-greedy 
method, the agent chooses randomly and uniformly among the candidate 
actions. In the softmax method, the agent does not choose in a uniform way, but 
takes into account the probabilities that are assigned to each action and, by 
extension, the expected utilities of the actions. It is clear that, to reach an optimal 
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result, the agent must balance exploration and exploitation in his action selection. 
The level of this balancing cannot be predetermined, since it is closely connected 
to the nature of the tasks that an agent must perform. In this study, we will use 
the softmax method for action selection, and the probabilities for the candidate 
actions will be calculated by using the Boltzmann distribution formula, as 

described above. 

4. Other Considerations 

The application of reinforcement learning shows potential to impact the 
following areas of social simulation: 

• Realism; the agents behave in a way closer to the way an actual 
human might behave 

• Flexibility: through a small number of parameters, the user can 
change the way the agents behave and, by doing so, explore an 
infinite number of action sequences. 

• Increased capabilities: Observational data shows that the 

scenario execution time becomes considerably lower, when 
compared to the time it takes to run a scenario with hard coded 
actions, thus allowing the users to use more agents in their 
scenarios and model more complex situations. 

• Traceability for analysis; the users can collect data about 

preferred action choices by the agents and provide an analysis of 
the potential reasons and motives behind those choices. 

In the case of the CG model, the transition from hard-coded actions to 
reinforcement learning enhanced the model’s capabilities without compromising 
any of the functionalities that the model provided in the past. The analysis that 
will be provided in Chapter IV can be considered as a verification step for the 
integration of utility-based action-selection code into the CG model. 
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III. THE CREATION OF THE UTILITY-BASED AGENT 


A. INTRODUCTION 

The purpose of this chapter is to give a detailed overview of the procedure 
that was followed for the creation of the utility-based agent inside the Cultural 
Geography (CG) model. We will show how the concept of reinforcement 
learning, as detailed in Chapter II, was implemented into the CG model code. 
The information presented in this chapter is intended to help the reader 
understand the reasoning behind the creation of the various new components for 
the CG model. To fully understand the code that was created, the reader must 
also be fairly familiar with Java programming. 

B. THE BASICS 

A utility-based agent is an agent who can decide on its actions based on a 
certain procedure. The introduction of such an agent in the CG model is a 
revolutionary move since, thus far, the actions of all agents inside the model 
were hard coded inside the Excel configuration file. For the creation of our agent, 
we based our design on the utility theory. We utilized a utility discount method 
and the final choosing of the action was based on the Boltzmann distribution. The 
theoretical background behind these techniques was covered in Chapter II, while 
the application of these techniques will be described in more detail in subsequent 
sections of this chapter. 

We decided to build our template for application on the insurgents 
component of the CG model. By doing that, we intended to replace any hard¬ 
coded insurgent actions from the scenario. From that point on, the insurgents 
would perform actions (or do nothing) according to our utility-based procedure. 
Initially, we created one such agent for the whole model. 
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Since we decided to use the insurgents component as our “guinea pig,” 
we needed to define a utility in accordance with the insurgents’ needs and 
interests. We concluded that the average change in the population’s stance on 
the issue of security would be a good enough measure of utility for our agent. 
Since our agent, being an insurgent, would like to decrease the notion of security 
inside the population, it makes sense to assume that a decrease in the 
population’s stance on security should result in a positive utility value for our 
agent. 

Five new classes were created in the CG model to support the new agent: 

ActionEnergy - This class is used to create ActionEnergy objects. 
These objects are utilized to store the expected utility values for each agent and 
each action. The list is updated with each activation of the utility agent. 

AgentAction - This class is used to create AgentAction objects. 
These objects are utilized to store all candidate actions per agent. The list is 
created anew with each activation of the utility agent. 

FiringTime - This class is used to create FiringTime objects. 
These objects are utilized to store the firing times that each action was "fired" 
(chosen and put into effect) per agent. As a new action fires, the list is updated. 

PointUtility - This class is used to create PointUtility objects. 
These objects are utilized to store the point utility values that are awarded to 
each agent at various times during the simulation run. These values are 
discounted appropriately before storage. 

SimpleUtilityAgentUmpire - This is the central class of the utility 
agent. All calculations and choice actions happen here. In this class, the 
following events take place: 

The collection of data 

The evaluation of data 

The scheduling of the chosen action 
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Action Energy 


Agent Percept 


Agent Action 



Figure 2. The classes of the utility-based agent and their interconnections 

1. Collection of Data 

During this procedure, we collect the population’s average stance on 
security by going through each agent in the model that represents a population 
segment and adding his stance on security to a grand total, which we average in 
the end. Having already stored the average value from our previous collection 
event, we can easily calculate the change in the average stance on security by 
subtracting the current average from the previous average. The result of this 
calculation will be used as our “ point utiHtV ' for this particular time in the 
simulation. A positive value of the point utility means that the average stance on 
security was decreased. This is a good result for our agent. 

The next step is to discount this utility value in the appropriate way. This is 
a result of a procedure during which the utility value is discounted according to 
when this particular action was used in the past for the particular agent. The final 
discounted value of utility, called Expected Utility , is stored in a list of 


15 
















ActionEnergy objects. This list contains the expected utility per action and per 
agent. This list is the final output of this procedure. 

2. Evaluation of Data 

During this procedure, the agent chooses between his six candidate 
actions according to the probability distribution specified by the Boltzmann 
distribution. The procedure is presented in more detail in Chapter II. The action 
that is selected, as a result of this procedure, is scheduled for firing with no time 
delay. Finally, the firing times list is updated with the new action that is being 
fired, and the point utility list for the particular agent is also updated by adding the 
action that was chosen along with the raw point utility value that was calculated 
during the collection of data. This procedure is repeated for each utility-based 
agent in the simulation. 

3. Scheduling of Chosen Action 


During this procedure, the chosen action is scheduled for execution in the 
simulation. 



Figure 3. The sequence of events inside the main class of the utility-based 

agent 
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C. ADDITIONAL NEW COMPONENTS 


The code for the utility-based agent became part of the CG model in 
version 0.7.1. The Excel configuration file was modified to reflect the changes in 
the code and to facilitate the scenario builder to create any scenario. A new tab 
was added to the Excel configuration file. Inside this tab, the user can enter all 
the information that the utility-based agent needs in order to function properly. 




V J U 


-A m mm if ^ ji-< 


It. 






i K * A 


. 




pcMidh U ii' ia iiai Hi irik- tefintorripT WrMt 




Figure 4. Snapshot of the new tab that supports the utility-based agent 


Another new component is a logger called Action Activation Data Logger. 
This logger records the activation levels for all candidate actions each time the 
utility-based agent repeats its choosing procedure. 

D. THE TEST RUN 

To test the functionality of our new agent, we created a “blank scenario.” 
In this “blank scenario,” no other actor of the CG model is performing any 
actions, except the new utility-based agent. The possible action choices for our 
agent, independent from the state of the environment, are as follows: 
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• AntiCoalitionForceMessaqe : The insurgents spread to the 
population a rumor against the coalition forces. 

• AttackCoalition Force : The insurgents conduct an attack on the 
coalition forces. 

• CivilianCasualtv : The insurgents attack a key civilian. 

• Damaqelnfrastructure : The insurgents attack and cause damage 
to an infrastructure facility. 

• DoNothinq : Self explanatory. 

We also prevented any consumption of consumables from all actors. In 
this way we eliminated any effects that the consumption might have on the 
population. Essentially, all infrastructure remained in a neutral state all 
throughout our “blank scenario.” We performed two test runs, using two extreme 
values of temperature (0.1 and 1.0). During the execution of each test run, we 
recorded the activation levels for each candidate action. After gathering all data, 
we constructed two overlay plots, one for each value of temperature. 
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Figure 5. Overlay plot for temperature = 0.1 
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Figure 6. Overlay plot for temperature = 1.0 


From these plots, we can see a difference in the way our agent chooses 
his actions between the two cases. 

In the case with the lower temperature (greedy case), the agent is 
supposed to choose the action with the highest activation level. We can clearly 
see (Figure 5) that the utility value for most actions stays at a relatively high level 
until a simulation time of about 35. The agent does not care which action he 
chooses, as long as this action produces the best expected result for him. 
Eventually, all actions’ activation levels grow smaller with time and they converge 
at a value right above zero. This can be explained by the fact that in our “blank” 
scenario no other actor does anything except our agents. Therefore, there are no 
other actions taking place that could influence the percepts our agent uses to 
calculate his utility value. With that in mind, it makes sense that the agent will 
collect the greatest amount of utility in the beginning of the simulation and, since 
nothing changes in his environment as the simulation progresses, his potential 
for any further improvements in his utility will gradually decrease until it reaches a 
value near zero. And this will eventually happen no matter which sequence of 
actions he might choose. 
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In the case with the higher temperature (exploration case), the agent 
might not necessarily choose the action with the highest activation level. This 
could lead, eventually, to a quicker conversion of the activation levels of all 
actions at a value close to zero (Figure 6). This can be explained by the fact that, 
in the exploration case, the agent does not favor one action over all others, thus 
leading to a more balanced choice between actions. 

Since there are no other actions being performed by the other actors of 
the CG model in this “blank scenario,” we observe the odd phenomenon of the 
steady decrease of the activation level for all actions. This absence of actions by 
all other actors in the model creates a “sterile” environment for our agent, thus 
not allowing us to draw any useful conclusions about the factors that contribute to 
the way our agent chooses his actions. What is necessary, in order to gain more 
insight on the way our agent works, is to put him in a fully active environment 
with other actors performing their own actions. We will examine such a scenario 
in Chapter IV. This test run was only supposed to test the functionality of our new 
agent and to give us a taste of the effects of temperature. In that regard, our test 
run was successful. 

E. THE EVOLUTION 

After the inclusion of the utility-based agent in the CG model, we tried to 
expand its applicability. We soon realized that, even though the creation of the 
agent was based on the insurgents component of the CG model, we could easily 
expand its use to the other components as well. All the scenario builder had to do 
was to identify the issue and position of interest, according to the agent he was 
describing, and enter the appropriate values in the Excel configuration file, as 
shown above (Figure 4). 

Although the use of one percept per agent for the calculation of the utility 
value was appropriate, as a first step, the reality is that an agent might need 
more than one percept from its environment in order to formulate its utility value 
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and, eventually, determine its next actions. For that purpose, a new class was 
created. This new class allows the agent to use multiple percepts (if needed) for 
the calculation of its utility value: 

AgentPercept - This class is used to create AgentPercept objects. 
These objects are utilized to store a list of percepts for each agent prototype. 

The final utility value is the weighted average of all the percepts of the 
agent. The weight of each percept is provided by the user, thus making the utility- 
based agent essentially user driven. 

It must be noted that, at this point, the use of multiple percepts is allowed 
only per agent prototype. This means that, if we allow the use of three percepts 
for the calculation of the utility of a certain agent prototype, all agents based on 
this prototype will use these three percepts in their calculations. There is no way 
that an agent of this particular prototype will use any other percepts. 
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IV. PROVING A POINT 


A. INTRODUCTION 

This chapter presents the effort that was conducted toward the verification 
of the new Cultural Geography (CG) model’s functionality. We will begin with a 
brief description of the CG model and its components. Then we will provide a 
detailed account of the experiments that were conducted for the verification of 
the CG model’s functionality. Finally, we will present the results of our analysis 
with the intention to prove that the agents inside the model behave in an 
acceptable way, thus answering the first of the two research questions posed in 
Chapter I (“Is reinforcement learning appropriate for use in social simulations and 
the CG model in particular?”). The second research question (“What are the 
advantages from the use of utility-based agents within the CG model?”) will be 
answered in Chapter V, during the discussion of the analysis results. 

B. THE CULTURAL GEOGRAPHY MODEL 

TRAC Monterey developed the CG model in Java, using Simkit as the 
simulation engine. Alt, Jackson, Hudak and Lieberman (2009) described the CG 
model as a “re-usable framework for representation of the civilian population 
within an IW context” (p. 2). The reusability of the model is achieved through a 
modular approach in its design. The model attempts to represent Kilkullens’s 
concept of “conflict ecosystem” that exists in an IW setting by utilizing a multi¬ 
agent system (MAS), as defined by J. Ferber (as cited in Valdez, 2009, p.9). 

The basic components of the CG model are the following: 

• Population 

• Infrastructure 

• Other actors 
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Figure 7. The Cultural Geography mocJel (From TRAC Monterey) 


All agents within the model are part of either the population or the “Other 
Actors” components shown in Figure 7. The infrastructure component is not 
represented by agents. 

The primary forces that govern an agent’s behavior within the CG model 
are the narrative paradigm and the theory of planned behavior (Alt et al., 2009). 
A brief description of these two concepts is provided, so the reader can form a 
general image of how the model works. We also talk briefly about the 
infrastrusture component of the model, describing the basic ideas behind this 
component. 

1. Narrative Paradigm 

The narrative paradigm is a theory developed by Fisher (1987). According 
to that theory, a narrative paradigm is “the incorporation of an entity’s beliefs, 
values, and interests into a story, through which an agent evaluates the other 
stories of the world” (Valdez, 2009, p. 12). Alt et al. (2009) describe the 
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procedure that is utilized to construct an agent’s narrative identity. The population 
in the area of interest is partitioned into entity types, according to relevant socio- 
demographic lines. These socio-demographic lines can be identified through the 
opinions of subject matter experts or polling data. For each entity type, a 
narrative identity is developed. Through a Bayesian network, these collections of 
beliefs and values are linked to the agent’s stance on the issues of security, 
elections and infrastructure, thus constructing a Bayesian belief network (TRAC 
Monterey, 2009, p. 81). Each agent prototype has its own belief network that sets 
it apart from other agent prototypes. The Bayesian network that is used for 
infrastructure identification is shown in Figure 8: 



Figure 8. Bayesian network for Infrastructure (TRAC Monterey, 2009) 

In this example, the indication “PROISAF” and “ANTIISAF” should be 
considered as a “YES” or “NO,” respectively. For example, a value of “PROISAF” 
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in the TOLERATE_OPIUM behavior means that the agent tolerates opium 
trafficking. A value of “ANTIISAF” in SUPPORT_US means that the agent does 
not support the U.S. 

Whenever an event happens that affects an agent prototype’s belief 
network, the values inside the network are updated accordingly. This results in 
an updated stance on security, elections and/or infrastructure. In addition, a 
“Homophily network” that represents “the likelihood of communication between 
two individuals in terms of their similarity among social factors” (Alt et al., 2009, p. 
10) can also influence an agent’s beliefs and, eventually, his stance on the 
above-mentioned issues. 

2. Theory of Planned Behavior 

The theory of planned behavior is the main force behind an agent’s 
actions. As described by Alt et al., “individuals within a group will form an 
intention to adopt a behavior based on: 1) their attitude toward the behavior, 2) 
their perception of the group norms associated with that behavior, and 3) the 
individual’s perceived level of behavioral control in regard to that behavior” (2009, 
p. 5). The theory was implemented inside the CG model by utilizing a Bayesian 
network. An example of such a network is shown in the following figure. The work 
detailed in this thesis is designed to address deficiencies in the current 
implementation of the theory of planned behavior using Bayesian networks. 
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Is the agent in control of the behavior? 


CB Do Nothing 

Success 49.5 
Failure 50.5 






Control 

Low 38.7 
High 61.3 






NB = Normative Belief 
BB = Behavioral Belief 
CB = Control Belief 


Figure 9. Theory of planned behavior network (From TRAC Monterey, 2009) 


3. Infrastructure Component 

Infrastructure is represented inside the model as multi-server queues. The 
agents can consume goods (food, water, etc.) or receive services (electricity, 
irrigation etc.). The agent’s interaction with the infrastructure objects, combined 
with his narrative identity, can result in a change in the agent’s stance on the 
issue of infrastructure. As is usually the case with multiple-server queues, we can 
establish limits to the server’s capabilities. For example, the availability of a 
service can be increased, due to actions of friendly forces (construction), or 
decreased, due to the actions of enemy forces (attacks). In addition, a cost can 
be associated with the server’s usage. For example, if a resource is located in 
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region A, an agent from region B can use this resource but he will have to pay a 
cost, whereas an agent located in region A will not have to pay this cost. 

C. THE EXPERIMENTS 

What follows is a description of the experimentation that took place with 
the CG model in order to monitor the agents’ behavior and draw results. Each 
experiment is presented in the following way: a brief scenario description, the 
independent variables (the factors that were manipulated), the dependent 
variables (the measures of effectiveness that were measured), constraints, 
limitations and assumptions (as necessary), and a description of the results 
along with tables and diagrams as deemed appropriate. All of the analysis was 
done using the JMP analysis software. The scenarios were developed in 
collaboration with TRAC Monterey. 

1. About the Scenarios 

All the scenarios of this study deal with a province of Afghanistan called 
Kandahar. More specifically, the population of this region was segmented into 
factions according to the following characteristics: 

• Family/Clan status 

• Tribe 

• Disposition 

• Political affiliation 

• Age - Gender 

The specific values that were used for these characteristics are shown in 
the following table: 
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Table 1. Demographic characteristics used for the segmentation of 

the population 


Family/Clan 

Status 

Tribe 

Disposition 

Political 

Affiliation 

Age / Gender 

Inherited 

Empowered 

(Barakzai, Popalzai, Mohammadzai) 

Urban 

Anti-Government 

Military Age Male 

Achieved 

Passive 

(Abkozai, Noorzai) 

Kuchi 

Neutral 

Elder Male 

Poor / 

Unemployed 

Marginalized 

(Noorzai, Ishaqzai, Alizai, Ghilzai) 


Pro-Government 

Military Age Female 





Elder Female 


Through consultation with experts in the region, the scenario building team 
examined all possible combinations of the demographic characteristics and 
narrowed the resulting population groups to 15. For each one of those groups, a 
narrative identity was created and was inserted in the model. 

All agents representing the population are also infrastructure consumers, 
following the general rules described in paragraph B3, above. 

For the purposes of our study, we will examine the population in an 
aggregate way, so there is no need to analyze the segmentation process in much 
more detail. 

2. Variables 

Independent Variables 

In all of the scenarios, we manipulated one or more of the following three 
factors: 

• Lambda: As previously described in Chapter II, this variable is the 
discount rate for percepts that will be received in the future. A 
discount rate closer to 0 drives the agent to maximize his short term 
reward by not giving too much value to future rewards. From this 
point on, we will refer to a rate closer to 0 as “short term memory.” 
In contrast, a discount rate closer to 1 drives the agent to maximize 
his long term reward. We will refer to such a rate as “long term 
memory.” 
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• Temperature: As described in Chapter II, the value of temperature 
defines the “greediness” of the agent. A value closer to 0 drives the 
agent to make greedy decisions (exploit), whereas a high 
temperature value allows more diversity in the agent’s decision 
making (explore). 

• Collection interval: This variable shows the amount of simulation 
time steps that elapse between two decisions of an agent. 

Dependent Variables 

For all scenarios, we measured the population’s stance on the issues of 
security, infrastructure and governance. Since we chose to examine the 
functionality of an insurgent agent, we used as our main measure of 
effectiveness the population’s stance on security. We made this decision 
because, according to all scenarios, the stance on security is the percept that the 
insurgent agent uses in order to calculate his utility values and, eventually, 
formulate his decisions. 

3. General Constraints, Limitations and Assumptions 


a. Constraints 

There is limited time in which to conduct this study, constrained by 
school requirements. 

b. Limitations 

• The population is represented by a total of 350 agents. 

• The other actors (Insurgents, Government of the Islamic Republic 
of Afghanistan (GIRoA), Afghanistan National Security Forces 
(ANSF), International Security Afghanistan Forces (ISAF)) are 
represented by one agent per actor and per region, a total of 20 
agents. 

• A small sample of experts provided survey input regarding the 
impact of events/themes on population beliefs. 
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c. Assumptions 

• 350 agents representing the major population groups within 

Kandahar Province provide sufficient fidelity to extract population’s 
beliefs and stances on issues. 

• Fully vetted expert input, with multiple years of experience 

concerning Kandahar, adequately represents the impact of events 
on the population identity groups. 

4. Candidate Actions 

The focus of our study will be on the agents that represent the insurgents 
inside the model. There is one agent per region, bringing the total of the 
insurgent agents to five. Each agent must choose among the following actions: 

• DoNothing: Self-explanatory. The agent performs no action. 

• KiliCivilServant; The agent makes an assassination attempt 

against a Civil Servant. 

• lED: The agent plants an lED against any target. 

• IED_ANSF: The agent plants an lED targeting the ANSF forces. 

5. The Experimental Design 

We conducted our research either by performing single runs of our 
scenarios or by doing a design of experiments based on these scenarios. In the 
cases in which we used a design of experiments, the method used was the Near 
Orthogonal Latin Hypercube (NOLH). This design allows us to fully explore the 
factor space and, thus, achieve approximate orthogonality of input factors. By 
using the NOLH method, the experimental design points form a representative 
subset of the hypercube of explanatory variables (Alt et al., 2009). The 
development of the design points was done by utilizing the Design of 
Experiments tool that was developed by TRAC Monterey. This tool creates 
design points based on the NOLH method and incorporates them into the 
scenario files that are used by the CG model. 
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D 


THE RESULTS 


In the following paragraphs, we will give a detailed account of the results 
of our statistical analysis. The results will be presented in such a way that they 
answer the research questions posed in Chapter I. All the general information 
detailed in section C above pertains to all scenarios. Any additional scenario- 
specific information is mentioned in each scenario paragraph as needed. 

1. Scenario 1 - Simple Run 


For this scenario, we controlled the collection interval variable by setting it 
to a value of 1 for every agent. Moreover, we gave to every agent (except Tall) a 
value of 0.01 for lambda and a value of 1 for temperature. For the agent Tall 
(who represents an insurgent agent in the region of Kandahar City (KC)) we gave 
a value of 0.9 for lambda and a value of 0.1 for temperature. These settings were 
inserted into the Excel scenario setup file as shown in Figure 10: 
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Figure 10. Setup of variables for all acting agents 
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With these values, we expect the agent Tall to act in a greedy way and 
have a long-term memory. All other agents are expected to act in a more 
exploratory way and have a short-term memory. In addition to that, we gave to 
the action KillCivilServant an enhanced reward value of 100 and kept the reward 
value of all other candidate actions at a value of 1. This enhancement was done 
only for the agent Tall. By doing that, we expected a greedy agent (like Tall) to 
choose the action with the enhanced reward more often than all other candidate 
actions. The scenario ran in one replication. 


The distribution of the chosen actions for Tall was: 
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Figure 11. Distribution of actions for agent Tall (scenario 1) 



The distribution of actions for agent Tal2 was: 
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Figure 12. Distribution of actions for agent Tal2 (scenario 1) 


By examining these two figures, it is clear that the agent Tall favors the 
KillCivilServant action much more than Tal2 does. However, is this difference in 
action choices a significant one? By performing a contingency analysis between 
the action choices of the two agents, we tried to determine whether the difference 
in the action choices between the two agents was significant or not. With an 
alpha level of 0.05, here are the results of this analysis: 

Table 2. Contingency table of the candidate actions by agent 
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Tests 

N DF -LogLike RSquare (U) 

364 3 5.0759665 0.0101 


Test 

Likelihood Ratio 
Pearson 


ChiSquare Prob>ChiSq 

10.152 0.0173* 

10.072 0.0180* 


Figure 13. ChiSquare test for the likelihood of action choice occurrence 

Both Figure 13 and Table 2 showed that the difference between the action 
choices of Tall and Tal2 is significant, with a probability that the difference in 
action choice happens by pure chance at a value of 0.0173. 

The next thing we wanted to examine was how the agent Tall made his 
choices over time. We already knew that he favored the KillCivilServant action 
over all others. We wanted to see how these choices were distributed over time 
and how these compare to the distribution of choices of Tal2 for this action. To 
accomplish that, we calculated the moving average of the number of times the 
action KillCivilServant for agents Tall and Tal2 were chosen and we constructed 
an overlay plot to better illustrate any differences between agents Tall and Tal2. 
The plot looks like this: 
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Figure 14. Moving averages of agents Tall, Tal2 over time for action 

KillCivilServant (Scenario 1) 


For the plot shown above, a value of 0.7 for the moving average at time 25 
means that in the previous ten time steps (between time steps 16 and 25) the 
action KillCivilServant was chosen seven times out of ten. Note that the action is 
selected much less frequently over the course of the run. Since the utility is 
based on the change in population stance from one observation to the next, as 
the changes in population stance go down, the activation levels do as well. In the 
current model, the population’s stance tends to converge over time and each 
subsequent action has less ability to move the issue stance. This results in a 
reduced effectiveness for all action choices by the end of the scenario run. This 
is a known issue with the model’s representation of issue stance, and future work 
is planned to address this. 
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So far, we have shown that the agent Tall works properly inside the CG 
model-making choices. More specifically, by enhancing the reward value of one 
candidate action over the others for one particular agent, we showed that this 
agent favors this action in a significant way, compared to other insurgent agents 
operating in neighboring regions. This is a strong indication that our agent makes 
effective use of reinforcement learning during the procedure of choosing his next 
action. Finally, we showed that this favoring is spread throughout the duration of 
the simulation and is not happening only at the beginning or at the end of the 
simulation run. Note that the impact of any action choice by actors late in the run 
results in a lower impact on the population, and lower utility, due to the 
convergence of the population’s issue stance over time. 

2. Scenario 1 - Experimental Runs 

For this run, we used the settings of the previous scenario. We designed 
an experiment by using the NOLH design for optimality. By varying the lambda 
and the temperature variables between the values of 0 and 1, we ended up with 
33 design points. The scenario ran for 10 replications. The total number of runs 
for our experiment was 330. The runs were performed in TRAC Monterey, using 
the computers located in the Conference Room lab. 

The design points and the respective values for lambda and temperature 
are shown in the following table: 
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Table 3. Design points for experimental design 
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At first, we examined our MOE (population stance on security for the 
Kandahar City location). We averaged our MOE across replications and design 
points. This is the plot of our averaged MOE over simulation time: 
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Figure 15. Plot of the mean population stance on security over simulation time 

Since the purpose of our agent is to decrease the population’s satisfaction 
with security in the area over time, it appears as if our agent was successful. 

To get an idea of which combination of lambda and temperature produced 
the most optimal results, we constructed a contour plot of our MOE over lambda 
and temperature. 
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Figure 16. 



Contour plot of the population’s stance on security over lambda and 

temperature 


Since we were interested in the lowest values of our MOE, we focused our 
attention on the areas within the two circles. The combinations of lambda and 
temperature within these two circles produce the greatest decrease in the 
population’s satisfaction, the goal of our agent. By examining the Table 3, we can 
determine that these design points are DPI 9, with lambda = 0.18 and 
temperature = 0.1, and DP7, with lambda = 0.72 and temperature = 0.52. We will 
compare these design points with DPS, which seems to produce the worst 
results, indicated in Figure 16 with a square. 
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Figure 17. Overlay plot of Population’s stance on Security over time by design 

point 


From Figure 17, we can see that among the design points in our contour 
plot, DP7 and DPI9 (represented by the bottom two lines in the plot) produce 
better results in less time, compared to design point DPS (represented by the top 
line in the plot) which also produces good results but in a much slower way. 

3. Scenario 2 - Experimental Runs 

For the purposes of this experiment, we constructed a scenario that 
treated all candidate actions for all acting agents in a similar manner. This means 
that all candidate actions have the same weight. No action is favored over any 
other. In each region, we placed one agent per category (GIROA, ANSF, ISAF, 
TAL). We focused our study in the region of Kandahar City. The factors we 
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examined for this scenario are the three factors that were mentioned in 
paragraph C2 of this chapter. These factors were the collection interval, the 
lambda and the temperature. We varied the values of lambda and temperature 
from 0 to 1. We varied the value of collection interval from 1 to 5. We did this for 
all acting agents in the model. By using the design of experiments tool built by 
TRAC Monterey, we constructed a design of experiments, based on the NOLH 
method, ending up with 512 design points. The simulation ran for five 
replications, bringing the total number of runs to 2,560. 

After averaging across design points, we performed a standard least 
squares regression, using the population’s stance on security as the dependent 
variable and our three factors per acting agent (a total of 12 factors) as the 
independent variables. In that way, we wanted to examine which of the above- 
mentioned factors contributes significantly to our MOE. Here are the results of 
our analysis at an alpha level of 0.05: 


Parameter Estimates 


Term 

Estimate 

Std Error 

t Ratio 

Prob>|t| 

Intercept 

0.238073 

0.002565 

92.83 

0.0000 * 

CollIntGIROADevI 

0.0009983 

0.000278 

3.59 

0.0003 * 

CollIntANSFDevI 

-0.001794 

0.000278 

-6.46 

<.0001 * 

CollIntlSAFDevI 

-0.011247 

0.000278 

-40.49 

0.0000 * 

CollIntTALDevI 

0.0074984 

0.000278 

27.00 

<.0001 * 

LambdaGIROADevI 

-0.000744 

0.001235 

-0.60 

0.5470 

LambdaANSFDevI 

0.0012615 

0.001235 

1.02 

0.3072 

LambdalSAFDevI 

-0.001114 

0.001235 

-0.90 

0.3670 

LambdaTALDevI 

-0.000328 

0.001235 

-0.27 

0.7909 

TempGIROADevI 

0.0001838 

0.001235 

0.15 

0.8817 

TempANSFDevI 

-0.001469 

0.001236 

-1.19 

0.2344 

TempISAFDevI 

0.000762 

0.001235 

0.62 

0.5374 

TempTALDevI 

-0.000405 

0.001235 

-0.33 

0.7430 


Figure 18. Analysis results for scenario 3 experimental run 


Figure 18 shows clearly that the collection interval contributes more 
significantly than any other factor to our MOE. This happens for all acting agents, 
no matter the role they play in the simulation. 
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Focusing our attention on the collection interval factor, we can examine 
which values of it produce the best results. For that purpose, we constructed 
plots of our MOE over each of the acting agents’ collection interval for the region 
we are examining (Kandahar City). 
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Figure 19. Plots of the population’s stance on security over the acting agents’ 

collection intervals 

The plot shows that better results were produced when the collection 
interval had a high value. This means that when our agent allows more time 
between his action choices, the actions he chooses produce better results. 

The analysis performed on this scenario showed that the collection 
interval variable outshines lambda and temperature in significance, when 
examined together. This is the reason why, in our previous scenarios, we 
controlled the collection interval variable in order to isolate the effects of the 
lambda and temperature variables and examine how these variables affect the 
action choices of our agents and, in extension, our MOE. 
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V. FINAL THOUGHTS AND A LOOK TO THE FUTURE 


A. INTRODUCTION 

The concluding chapter of this study will begin with a brief discussion of 
the major findings of our analysis results, in which we will attempt to show the 
benefits of this study. Finally, we will present our suggestions for future study on 
the application of reinforcement learning within the Cultural Geography (CG) 
model. 

B. DISCUSSION OF ANALYSIS RESULTS 

To better illustrate the results of our analysis, it is deemed appropriate to 
associate them with the research questions we posed in Chapter I. The two 
questions were: 

• Is reinforcement learning appropriate for use in social simulations 
and the CG model in particular? 

• What are the advantages of the use of utility-based agents within 
the CG model? 

To answer the first question, we tried to show: 

• That the learning agent prototype we developed was functioning 
properly 

• That it was producing the expected results within the simulation. 

To show that the learning agent was functioning properly, all we had to 
examine was whether it was using reinforcement learning for its action decisions. 
We focused on one agent in a specific region and enhanced the utility value of 
one of its candidate actions over all others. Moreover, we made our agent 
greedy, so that it would favor that action over all its other candidate actions. We 
kept our third independent variable, the collection interval, constant for all agents 
in our scenario. The results showed that our agent preferred the action with the 
enhanced utility throughout the duration of the simulation run. When compared to 
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another agent in the same region, who was not greedy and did not have one of 
its candidate actions blessed with enhance utility, we saw that the difference in 
the two agents’ action choices was a statistically significant one. This is a strong 
indication that our learning agent functions using the principles of reinforcement 
learning. 

After performing a design of experiments on the same scenario, we saw 
that our learning agent produces the desired result, namely a decrease in the 
population’s mean stance on security. Moreover, by constructing a contour plot of 
our MOE over lambda and temperature, we had the opportunity to isolate the 
combinations of lambda and temperature that contribute to the best and the 
worst results. A comparative analysis of these two combinations showed that 
they both produce the desired result, but the combination that produces the best 
results does so in a much faster way. A calculation of that difference showed 
that, by choosing the right combination of lambda and temperature, we gain the 
desired results about 12% faster, on average. 

Finally, we performed a design of experiments based on a more general 
scenario, one that treated all agents and their candidate actions in the same 
manner. In that experiment, we examined the impact of all three independent 
variables (lambda, temperature and collection interval) on our chosen MOE. The 
results showed that the collection interval plays the most significant role in 
determining the value of our MOE, far outshining any effects the other two 
variables might have. By further examining the effects of the collection interval, 
we discovered that the best results are produced when the collection interval has 
a high value. This makes intuitive sense because when an agent allows for more 
time to elapse between his action choices, it can allow its previous action to 
better show its effects, and it can also gather more information from its 
environment in order to make its next decision as good as possible. 
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The answer to the second question is not the result of any statistical 
analysis, but stems from the experience we gained by working with our learning 
agents within the CG model. Our conclusions about this research question can 
be summarized in the following: 

• The agents behave in a more realistic way. In the previous 
incarnation of the CG model, the agents performed their actions 
according to a predetermined plan that was hard coded inside the 
scenario file. This fact did not allow the agent to assess the 
situation and act accordingly. No matter what was happening in the 
world around it, the agent would perform its predetermined actions. 
With reinforcement learning, the agent gathers information from its 
environment and uses it to formulate an action plan. By constantly 
reexamining its past actions and their results, the agent evaluates 
all candidate actions and eventually decides on the course of action 
that best serves its interests. 

• The implementation of reinforcement learning in the CG model 
allows for more flexibility for the user. Through a small number of 
parameters, the user can change the way the agents behave and, 
by doing so, explore a potentially infinite number of action 
sequences. Moreover, the setup time for the scenarios becomes 
considerably easier and faster, since the user does not have to 
create pre-scripted actions for each acting agent. 

• Our experimentation showed that the scenario execution time of the 
CG model with reinforcement learning becomes considerably lower, 
when compared to the time it takes to run a scenario with hard 
coded actions, thus allowing the users to make better use of their 
computer resources by adding more agents in their scenarios and 
modeling situations that are more complex. At first, this might 
sound counter intuitive. How can a scenario that makes the agents, 
choose and then execute their actions, run faster than a scenario 
that only makes agents execute pre-scripted actions, and there is 
no choosing involved? To answer this question, we must take a 
look at what goes on inside the model during execution time. In the 
older version of the CG model, the agent had to access the model’s 
Bayesian network through Netica. Netica is an application, separate 
from the CG model, but running in parallel, which handles all 
computations involved during the execution of each action as well 
as the updating of the population’s stances, after the execution of 
each action. Netica was called even during scenarios with scripted 
actions to control action selection related to infrastructure objects. 
Each agent within the model possessed a separate behavior 
network for each commodity provided by the infrastructure servers 
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as well as any other actions they might have. In the new version of 
the CG model, the necessary computations during the execution of 
each action, as well as all action selections related to infrastructure, 
happen inside the model, thus resulting in considerable gains in 
execution time, since it is no longer necessary to call a separate 
application. 

• In the new version of the CG model, all actions performed by the 
agents can be traced. This facilitates analysis by potential users to 
gain an understanding of the impact of agent actions, facilitating 
course of action analysis. 

C. FUTURE WORK 

There are many potential avenues a researcher could explore, using this 
study as a starting point. Some of these areas are illustrated below: 

• Use of another action choice algorithm : Instead of using the 
softmax algorithm for action selection, it would be interesting to 
utilize another algorithm, for example the £-greedy algorithm, for the 
action selection process. A comparative analysis with the softmax 
action selection algorithm could prove to be quite interesting. 

• Establishment of a schedule for the gradual decrease of 
temperature over time : In the implementation of the softmax 
algorithm presented in this study, the temperature is being held 
constant throughout the scenario execution. One possible 
opportunity for research could be the creation of a function that 
would decrease the value of temperature as the scenario 
advances. In that way, we could make the agents move from 
exploration to exploitation in a controlled way, similar to that of the 
simulated annealing search algorithm (Russell & Norvig, 2003). 

• Analysis of the impact of reinforcement learning in other 
components of the CG model : In this study, we focused our 
analysis on the impact that reinforcement learning has on the 
population’s stance on security. There are many other components 
in the CG model that could be explored in order to determine the 
impact of reinforcement learning on them. For example, one study 
could examine the impact of reinforcement learning on 
infrastructure. More specifically, how the implementation of 
reinforcement learning in the infrastructure consumption chain 
affects the action choices of the consumers. 

• Allow the agent to use multiple percepts for his utility 
calculation : The implementation presented in this study allows the 
agent to use only one percept for his utility calculation. Although 
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this is acceptable as a first step, it cannot be considered as a final 
solution. A more realistic approach would be to allow the agent to 
use multiple percepts for his utility calculation. These percepts 
would be user defined and would apply to all agents of the same 
agent prototype. It should be noted that the infrastructure for this 
implementation is already in place within the CG model. Only minor 
changes in the reinforcement learning algorithm code are required 
for making the proposed approach a reality. 
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